Search CORE

23 research outputs found

Format Abstraction for Sparse Tensor Algebra Compilers

Author: Amarasinghe Saman
Chou Stephen
Kjolstad Fredrik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/11/2018
Field of study

This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code to compute any tensor algebra expression on any combination of the aforementioned formats. To demonstrate our technique, we have implemented it in the taco tensor algebra compiler. Our modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations. Furthermore, by extending taco to support a wider range of formats specialized for different application and data characteristics, we can improve end-user application performance. For example, if input data is provided in the COO format, our technique allows computing a single matrix-vector multiplication directly with the data in COO, which is up to 3.6

\times

faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201

arXiv.org e-Print Archive

DSpace@MIT

Sparse Tensor Transpositions

Author: Ahrens Peter
Amarasinghe Saman
Chou Stephen
Kjolstad Fredrik
Mueller Suzanne
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/05/2020
Field of study

We present a new algorithm for transposing sparse tensors called Quesadilla. The algorithm converts the sparse tensor data structure to a list of coordinates and sorts it with a fast multi-pass radix algorithm that exploits knowledge of the requested transposition and the tensors input partial coordinate ordering to provably minimize the number of parallel partial sorting passes. We evaluate both a serial and a parallel implementation of Quesadilla on a set of 19 tensors from the FROSTT collection, a set of tensors taken from scientific and data analytic applications. We compare Quesadilla and a generalization, Top-2-sadilla to several state of the art approaches, including the tensor transposition routine used in the SPLATT tensor factorization library. In serial tests, Quesadilla was the best strategy for 60% of all tensor and transposition combinations and improved over SPLATT by at least 19% in half of the combinations. In parallel tests, at least one of Quesadilla or Top-2-sadilla was the best strategy for 52% of all tensor and transposition combinations.Comment: This work will be the subject of a brief announcement at the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '20

arXiv.org e-Print Archive

Crossref

Compiling Recurrences over Dense and Sparse Arrays

Author: Kjolstad Fredrik
Sundram Shiv
Tariq Muhammad Usman
Publication venue
Publication date: 08/09/2023
Field of study

Recurrence equations lie at the heart of many computational paradigms including dynamic programming, graph analysis, and linear solvers. These equations are often expensive to compute and much work has gone into optimizing them for different situations. The set of recurrence implementations is a large design space across the set of all recurrences (e.g., the Viterbi and Floyd-Warshall algorithms), the choice of data structures (e.g., dense and sparse matrices), and the set of different loop orders. Optimized library implementations do not exist for most points in this design space, and developers must therefore often manually implement and optimize recurrences. We present a general framework for compiling recurrence equations into native code corresponding to any valid point in this general design space. In this framework, users specify a system of recurrences, the type of data structures for storing the input and outputs, and a set of scheduling primitives for optimization. A greedy algorithm then takes this specification and lowers it into a native program that respects the dependencies inherent to the recurrence equation. We describe the compiler transformations necessary to lower this high-level specification into native parallel code for either sparse and dense data structures and provide an algorithm for determining whether the recurrence system is solvable with the provided scheduling primitives. We evaluate the performance and correctness of the generated code on various computational tasks from domains including dense and sparse matrix solvers, dynamic programming, graph problems, and sparse tensor algebra. We demonstrate that generated code has competitive performance to handwritten implementations in libraries

arXiv.org e-Print Archive

The Tensor Algebra Compiler

Author: Amarasinghe Saman
Chou Stephen
Kamil Shoaib
Kjolstad Fredrik
Lugato David
Publication venue
Publication date: 21/02/2017
Field of study

Tensor and linear algebra is pervasive in data analytics and the physical sciences. Often the tensors, matrices or even vectors are sparse. Computing expressions involving a mix of sparse and dense tensors, matrices and vectors requires writing kernels for every operation and combination of formats of interest. The number of possibilities is infinite, which makes it impossible to write library code for all. This problem cries out for a compiler approach. This paper presents a new technique that compiles compound tensor algebra expressions combined with descriptions of tensor formats into efficient loops. The technique is evaluated in a prototype compiler called taco, demonstrating competitive performance to best-in-class hand-written codes for tensor and matrix operations

DSpace@MIT

Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture

Author: Hsu Olivia
Kjolstad Fredrik
Olukotun Kunle
Rucker Alexander
Zhao Tian
Publication venue
Publication date: 06/11/2022
Field of study

We introduce Stardust, a compiler that compiles sparse tensor algebra to reconfigurable dataflow architectures (RDAs). Stardust introduces new user-provided data representation and scheduling language constructs for mapping to resource-constrained accelerated architectures. Stardust uses the information provided by these constructs to determine on-chip memory placement and to lower to the Capstan RDA through a parallel-patterns rewrite system that targets the Spatial programming model. The Stardust compiler is implemented as a new compilation path inside the TACO open-source system. Using cycle-accurate simulation, we demonstrate that Stardust can generate more Capstan tensor operations than its authors had implemented and that it results in 138

\times

better performance than generated CPU kernels and 41

\times

better performance than generated GPU kernels.Comment: 15 pages, 13 figures, 6 tables

arXiv.org e-Print Archive

Compiler Support for Sparse Tensor Computations in MLIR

Author: Bik Aart J. C.
Kjolstad Fredrik
Koanantakool Penporn
Shpeisman Tatiana
Vasilache Nicolas
Zheng Bixia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/02/2022
Field of study

Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Developing and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation task, and letting a sparse compiler generate sparse code automatically from a sparsity-agnostic definition of the computation. This paper discusses integrating this idea into MLIR

arXiv.org e-Print Archive

The Sparse Abstract Machine

Author: Emer Joel
Horowitz Mark
Hsu Olivia
Kjolstad Fredrik
Olukotun Kunle
Sharma Ritvik
Strange Maxwell
Won Jaeyeon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/03/2023
Field of study

We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse primitives that encompass a large space of scheduled tensor algebra expressions. SAM dataflow graphs naturally separate tensor formats from algorithms and are expressive enough to incorporate arbitrary iteration orderings and many hardware-specific optimizations. We also present Custard, a compiler from a high-level language to SAM that demonstrates SAM's usefulness as an intermediate representation. We automatically bind from SAM to a streaming dataflow simulator. We evaluate the generality and extensibility of SAM, explore the performance space of sparse tensor algebra optimizations using SAM, and show SAM's ability to represent dataflow hardware.Comment: 18 pages, 17 figures, 3 table

arXiv.org e-Print Archive

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Author: Ejjeh Adel
Hellsten Erik
Hsu Olivia
Kjolstad Fredrik
Lacouture Rubens
Lenfers Johannes
Nardi Luigi
Olukotun Kunle
Souza Artur
Steuwer Michel
Publication venue
Publication date: 11/04/2023
Field of study

We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimiza tion algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art autotuners by delivering on average 1.36x-1.56x faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9x-3.9x faster

arXiv.org e-Print Archive